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Abstract 

Splines are useful building blocks when constructing priors on nonparametric mod- 
els indexed by functions. Recently it has been established in the literature that hier- 
archical priors based on splines with a random number of equally spaced knots and 
random coefficients in the B-spline basis corresponding to those knots lead, under 
certain conditions, to adaptive posterior contraction rates, over certain smoothness 
functional classes. In this paper we extend these results for when the location of the 
knots is also endowed with a prior. This has already been a common practice in 
MCMC applications, where the resulting posterior is expected to be more "spatially 
adaptive" , but a theoretical basis in terms of adaptive contraction rates was missing. 
Under some mild assumptions, we establish a result that provides sufficient conditions 
for adaptive contraction rates in a range of models. 

Keywords:Adaptive estimation, bayesian non-parametric, optimal contrac- 
tion rate, spline, random knots. 



1 Introduction 

The Bayesian approach in statistics has become quite popular in recent years as an al- 
ternative to classical frequentist methods. The main appeal of the Bayesian methodology 
is its conceptual simplicity: given a model for the observed data X ~ Pt, f £ J 7 , some 
space of functions, put a prior on the unknown parameter / and draw inferences based on 
the resulting posterior H(f\X). Knowledge about the model under study can also be be 
incorporated into the inference procedure via the prior. However, some seemingly "cor- 
rect" priors can lead to unreasonable posteriors, especially in nonparametric models. It is 
therefore desirable to place ourselves in a setting where it is possible to assess the quality 
of the resulting posterior from some objective point of view. 



This gave rise to the development of the notion of contraction rate (cf. iGhosal et al 



(2000)), a Bayesian analog of a convergence rate: data is assumed to come from a fixed 



probability measure Pq = Pf for a "true" /o € the contraction rate is then the smallest 
radius such that the posterior mass in a Hellinger ball of probability measures around Pq 
converges to 1 in PcrP ro bability as some information index such as a sample size goes to 
infinity. 
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Some general results about posterior contraction rates establish sufficient conditions 
on prior distributions such that the resulting posteriors attain a certain contraction rate. 
In this spirit, when studying specific priors, some authors now choose to present their re- 
sults in the form of say me ta-theorems which claim that sufficient conditions (such as the 



Ghosal et al. ( 2000l )) required to attain a certa i n rang e of contraction ra t es hol d 



ones m _ 

for their choice of prior: cf. | de Jonge and van Zantenl (|2012 ). IShen and Ghosall (j2012h . 
van der Vaart and van Zantenl ( 20081 ) and further references therein. We adopt this prac- 
tice here as well. 

In the case where /o is a function from some functional space of smoothness a, the 
posterior contraction rate is typically compared to the convergence rate of the minimax 
risk (called optimal rate) over that space in the estimation problem. For example, if we 
observe a sample of size n and want to estimate a univariate a-smooth function (e.g., 
density or regression function), the typical optimal rate is of order n~ a K 2a+l \ possibly up 
to a logarithmic factor depending on the risk function. If the smoothness parameter a is 
unknown, and one wants to build estimators which attain the optimal rate corresponding 
to a but do not depend explicitly on a, one speaks of an adaptation problem. In a 
Bayesian context, the adaptation problem consists in finding a prior which leads to the 
optimal posterior contraction rate (usually up to a logarithmic factor) for any ce-smooth 
function of interest and does not depend on the smoothness parameter a. Such priors 



been studied in different settings: cf. de Jonee and van Zanten 


( 


2012), 


Shen and Ghosall 


(2012), van der Vaart and van Zantenl ( 


20081) . van der Vaart and van Zanten 


(2009|) and 


Bclitser and Ghosall ( 


2003) among others. 



Splines, in particular, can be used when constructing adaptive priors. A spline (cf. 
de B oor (H)) is a piecewise polynomial function designed to have a certain level of 
smoothness which is referred to as its order. Splines are easy to store, differentiate, inte- 
grate and evaluate on a computer, and are extensively used in practice for constructing 
good, parsimonious approximations of smooth functions. The points at which the differ- 
ent polynomial pieces of a spline connect are called knots. If an order (read: maximal 
polynomial degree) and a set of knots is fixed, then the space of all splines with that order 
and those knots forms a linear space which admits a basis of so called B-splines. Any 
spline of a fixed order is consequently characterized by a set of knots and its coordinates 
in the B-splines basis corresponding to those knots. Randomly generating a number of 
knots and, given those, generating random coordinates in the corresponding B-spline basis 
with equally spaced knots results in a random spline whose law can be used as a prior. If, 
given the number of knots, the coordinates in the corresponding B-spline basis are chosen 
to be independent and normally distributed, then the result i ng sp line has a conditionally 
Gaussian law and was studied by de Jonge and van Zantenl ( 20121 ) by using Reproducing 



Kernel Hilbert Space techniques. Shen and Ghosal (|2012l ) "propose a more general, random 
series prior: the coefficients in the series are not necessarily independent or Gaussian and 
a basis other than the B-spline basis can also be used. 

The case where the locations of the k not s are also random is not covered by the results 
of either Ide Jonge and van Zantenl d2012l ) or lShen and Ghosall d2012l ). However when prac- 



titioners put a prior on the number of knots they almost invariably also put a prior on the 
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(jl998h . bi Matteo et al.1 (j200lh . ISharef et al 



l ocati ons of the knots (e.g., iDenison et al. 

(|2O10l )) - a Poisson process is a popular choice. Their motivation for allowing arbitrarily 
located knots seems to be twofold. Firstly, this is attractive from the implementation 
point of view: designing reversible jump MCMC samplers is much simpler if any collec- 
tion of knots is allowed since new knots can be inserted at arbitrary positions causing only 
localized changes in the spline. Secondly, the resulting posterior based on the prior with 
random locations of the knots is expected to be more " spatially adaptive" : the function of 
interest may not have a fixed level of smoothness throughout its support, it may consist 
of rough and smooth pieces. To sustain an adequate level of accuracy over the whole 
support, more knots are needed in rough pieces and less in smooth ones. Therefore, to 
make it at least possible for the resulting posterior to pick up eventual spatial features of 
the function, the prior has to be flexible e nough to model random lo c ation s of the knots. 

In this paper, we ex tend the results of Ide Jonge and van Zantenl (|2012l ). and those of 
Shen and Ghosall ( 20121 ) in respect to the prior with random knots: we add one more 



level to the hierarchical spline prior by putting a prior on the location of the knots of 
the spline as well, making, in fact, the basis functions also random. Under some mild 
assumptions on the proposed hierarchical spline prior, we establish our main result for the 
proposed prior, providing sufficient conditions for adaptive, optimal contraction rates of 
the resulting posterior in a range of models (among others: density estimation, nonpara- 
metric regression, binary regression, Poisson regression, and classification). In doing so, 
we provide a theoretical basis for the common practice of using randomly located knots 
in spline based priors. 



2 Notation and preliminaries on splines 

First we introduce some notation. For d £ N and 1 < p < oo denote by \\x\\ p = 
( Xa=i \ x i\ P ) the Zp-norm of x = (x\, . . . ,Xd) £ M d and by \\xWoo = maxj = i v .. i( i 
For 1 < p < oo let the L p -norm of a function / on [0,1] be ||/|| p = ( Jq \f(x)\ p dx) 1 ^ and 

WfWoo = SUp xe[0il] \f(x)\. 

We use < (respectively >) to denote smaller (respectively greater) or equal up to a 
constant, the symbols a V b and a A b stand for max{a, 6} and min{a, b} respectively. The 
covering number iV(e, S, d) of a subset S of a metric space with balls of size e is the smallest 
number of balls (with respect to distance d) of radius e needed to cover S. 

Now we prov ide some preliminaries on splines, which can be found, for example, in 
Schumakerl (j2007h . A function is called a spline is of order q £ N, with respect to a certain 



partition of its support, if it is q — 2 times continuously differentiable and when restricted 
to each interval in this partition, coincides with a polynomial of degree at most q — 1. 
Consider q £ N, q > 2, which will be fixed throughout the remainder of this text. For any 
j £ N, such that j > q let JCj = {(h, ... , kj_ q ) £ (0, l) j ~ q : < ki < ■ ■ ■ < k^ q < 1}. 
We will refer to a vector k = kj £ JCj as a set of inner knots; the index j in kj will 
sometimes be used to emphasize the dependence on j. A vector k £ Kj will be said to 
induce the partition {[ko, [k±, fe), • • • > [kj- q , kj- q +i]\, with ko = and kj^ q+ \ = 1. 
For any k £ fCj we will call M{k) = max|~^ +1 — the mesh size of the partition 
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induced by k and m(k) = min 



=1 



fcj — ki-i\ the sparseness of the partition induced by 



k. For a fe G /Cj, denote by <S fc = S q the linear space of splines of order q on [0, 1] with 



simple knots k (see the definition of knot multiplicity in ISchumakerl (|2007l )). This space 



has dimension j and admits a basis of so called B-splines {B±, . . . , B k }. The construction 
of {B k , . . . , involves the knots k- q +i, . . 
with arbitrary extra knots fc-g+i < • • • < k-~ : 
kj. Usually one takes k- q+ \ = ••• = &_ 



,k-i, 
< k 
= 



k ,h, 
= and 1 



kj—qi kj — q+i, kj — q+2i ■ ■ ■ , kj , 



kj-q+1 < kj- q+ 2 < 



< 



k 



and 1 = kj- q+ i 



kj, and we 



adopt this choice here as well. These basis functions are nonnegative: Bf(x) > 0, for all 
x G [0, 1]. Besides, they have local support and form a partition of unity: 



Bf{x) = for x $ [k- q+i , ki], B i( x ) = 1 for all x G [0, 1]. 



(1) 



To refer explicitly to the coordinates a = (ai,...,Oj) £ K J of a spline on a specific 
B-spline basis with inner knots k, we write s a ,k(%) = Yll=i a iBf{x), x € [0,1]. Since 
X^=i Bf(x) = 1, it is easy to see that for any s a ,k, Sfe.fe G <S q 



\s a ,k — Sb.k\\2 < \\s a ,k — Sb.k\\oo < II d 



& oo < 



a 



(2) 



Splines have good approximation properties for sufficiently smooth functions provided 
they are defined on a partition with appropriately small mesh size. We say that a function 
/ on [0, 1] belongs to a generic smoothness class J- a , a > 0, if for any set of inner knots k 
there exists a spline s aj fc G S q such that for some bounded Cf 



\\f -s a ,k\\oo<C f M a (k). 



(3) 



We will also be assuming that J- a is contained in a Lipschitz class: J- a C C(K a ,L a ) = {/ : 
\f(xi) - f(x 2 )\ < L a \x\ - x 2 \ Ka ,xi,X2 G [0, 1]} for some K a ,L a > 0. 

A leading example of a smoothness class T a is the Holder space H a = H a (L, [0, 1]), 
< a < q, which is the collection of all functions / that have bounded derivatives up 
to order uq = [a\ = m&x{z G Z : z < a} and such that the «o-th derivative satisfies 
the Holder condition |/ (Q(,) 0) - f {ao) (y)\ < L\x - y\ a ~ a °, for L > an d x,y G [0,1]. In 
this CclSG, Si well-known spline approximation result (cf. Ide Boorl (|l978l )) claims that ([3]) 
holds with C f = CgWf^Woo for some constant C q depending only on q. Other examples 
of smoothness classes for which the approximation property ([3]) hold, include a-times 
continuousl y differentiable fu nctions, Sobolev and Besov spaces; cf. Theorems 6.21, 6.25 
and 6.31 in ISchumakerl (|2007l ). 



3 Main Result 

We begin by describing a hierarchical prior on S = S q = WjL q LlkelCj S q : first draw a 
number J G N, J > q; then, given J, generate independently (J — q) inner knots Kj G ICj 
and also independently, J B-spline coefficients G M J . Our prior on S will be the law of 
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the random spline sq^Kj- We impose the following conditions on this prior. For c%,C2 > 0, 
< ti,t2 < 1 and all sufficiently large j, 



> j) <exp(-cijlog* 1 j), (4) 
= j) ^ exp ( - c 2 j log* 2 j) . (5) 

For some r > 1, C3 > 0, < t% < 1, and all j > g, 

P( m (K i )<^)|J = J )=0, (6) 
P(M(K j ) < r/j\J = j) > exp ( - c 3 j log* 3 j) , (7) 

where 5(i) is a positive, strictly decreasing function on N. Without loss of generality 
assume that 5(i) < 1, i S N. For each j > q, the conditional distribution of 9 € 
satisfies the following condition: for any M > there exists Co = co(M) such that 

P(||0 - ^olloo < e\J = j) > exp ( - c j log(l/e)) (8) 

for all e > and all O G ^ such that ||0o||oo < M. 

For examples of particular choices on the components of our hierarchical prior which 
verify these conditions we refer the reader to Section [5j 

Denote C 3 (M) = [— M, M] J . The following theorem is our main result. 

Theorem 1. Let ||/o||oo < M and /o € J- a so that holds with Cf . Let e n ,e n be two 
positive sequences such that e n > e n , e n — > as n — > 00 and ne\ > 1. Assume that there 
exist sequences J n , J n > q, M n > and a constant cm > c\ satisfying: 

-J n (M n VI) 



Jnlog 



e n 5(J n 



< 



net (9) 



log* 1 J n 



-2 

< Jn, P(0 £ V{M n )\J = j) < e X p(-c M n4), q < j < J n , (10) 



1/a <Jn, log* 2V * 3 J n <logl. (11) 



lr"C f0 

Let S n = U^ q U keK sp) {s 0)k € : \\0\loo < M n ) , where K,] = {k € JCj : m(k) > 5}. 
Then it holds that 

logiV(e n ,«S n ,|| • || 2 ) <ne 2 n , (12) 
P(s e ,K., S n ) < exp ( - cinel) , (13) 
P{\\se,Kj - /olloo < 2e n ) > exp { - (c (M) + c 2 + c 3 )J n log(l/e„)}. (14) 

Remark 1. Consider constants 04,05 > and a function 5(-) as above. If condition ([6]) 
is replaced by 

J>(J = j)W{m(Kj) < 5(j)\J = j) < c 5 exp(- C 4n), (6') 
3=1 

then the conclusions of Theorem [1] remain valid so long as J n is a sequence satisfying ([9]) 
and ([TO]) (cf. Section [5] and Remark [J] for a comparison of ([6]) and ([6 1 ]) .) 
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Proof. First we establish JT2}. Let L n (j) = 4M n j(q + and j > q. Let 

{01, . . . , 6 mi } be an e n /2-net of the set {6 G M J : ||0||<x> < Af„} and let {x\, . . . , x m2 } 
be an e n /(2L n (j))-net of {x G IR- 7-9 : x G (0, l)^ 9 }, both with respect to the || • ||oo- 
norm. Then, by using ([2]) and Lemma [2] (Lemma [2] is applicable since e n /(2L n (j)) < 
<5(j) for sufficiently large n), {s0 fe)3;; ,/c = l,...,mi, / = 1, ...7712} forms an e„-net of 
^fceA: 4 '^ { s °' k e ^9 ' 1 1 ^ 1 1 00 < with respect to the || • ||oo-norm. By using this fact, we 
obtain 



N(e n ,S n , || • || 2 ) < N(e n ,S n , \\ ■ ||oo) 

J n 



,{0GM J ': ||fl||oo < M n },\\ ■ I 



3=1 

J n 



AT 



(0,1) 



J-9 





"2iv ?1 ( J n ) " 











The last relation and ([§]) imply (|12j) : 

logiV(e n ,S n , || • || 2 ) < J n log 



2Ln(i) 

16(g + l)(M w V_l ) 2 JnY' - 
J ra (M n VI) - 

en^(J„) - 



00 



Now we check (|13j) . From the definition of S n , the relations 



and (fT0|) . it follows 



that 



( s o,Kj 5 n ) < P(J > J n ) + ]T>(J = j^m^-) < 5(j)|J = j) 



J=<3 



+ J>(J = j)P(0 2#(M n )|J 



J=5 

< exp { - ci J n log* 1 J n ) + + exp { - c M ne 2 n } 

< exp{ - cine 2 }- 

It remains to prove (fbi|) . First note that, by using ([3]) and (fTT]l . for all j > J n and 
for all sets of knots kj G fZj such that M(kj) < r/j, there exists a spline S0 O) fe. G S q 3 (of 
course, #o = #o(&j) = 0o(fc?,/o)) sucn that 



fo - s 0O)fe Joo < C fa M a {k 3 ) < C fo r a J- a < e n . 



(15) 



Since ||/o||oo < and J n must grow with n in view of (|lip . it follows from Lemma [3] 
and (H5|) ||0o(fcj)lloo < M for all kj G JCj such that M(kj) < r/J n for j > J n . 

Introduce the events: E{ = {M(Kj) < r/j}, E\ = {|[/ - s e ^ K .) tK . < e n }, 
4 = {\\0 {Kj) - flUoo < e n }, E£ = {||/ - Sfl,jf,||oa < 2e n } and ^ = {\\0 {K ^ < M}. 



6 



Using the argument from the previous paragraph, the triangle inequality, ([2|) and (|15p . we 
obtain that 

E'( n C ^f" C £?/ n , £^ fl JSg C E{, j > q. (16) 

Combining ©, ©, ©, (ffU) and (US]), we prove (fT3)) : 

>P(J = 
> P(J = 
= P(J = 
= P(J = 
>P(J = 

£ exp ( 
£ exp ( 

□ 

Remark 2. If the range of the underlying curve fo is contained in some known interval 
[a, b] C R, then, according to Lemma[3]and the proof of property (|14p . the prior on £W 
can be chosen to be supported on, say, [a — 1,6+ lp so that ([8]) has to hold only for 
6q £ [a — 1, 6 + lp . Condition (JTHD will trivially be satisfied for M n > (1 - a) A (6 + 1). 

Remark 3. If (j20|) is assumed instead of (J7|), the proof of (fT4"|) can then be simplified a 
lot, as in this case one can condition on the event {Kj n = fcj n } so that 6q = Oofej ) 
becomes fixed and ¥(Ef\J = J n , Kj = kj n ) = 1. 

Remark 4. Condition (|6|) is used in the proof of Theorem [1] exclusively to enforce 
J2j=q^{J = j)^{ m (Kj) < 3(j)\J = j) to be zero. Inspection of the proof shows, 
however, that it would suffice to require this sum to be upper-bounded by a multiple 
of exp { — cine„}. Although this would be a weaker requirement, typically the sequence 
e n will depend on the unknown smoothness a. Note however that since e n > e n and e n 
will obviously be taken to converge to 0, then for large enough n, cine n < n. This allows 
the term ^2j =q F[j = j)W(m(Kj) < S(J)\J = j) to be absorbed into the remaining terms 
of the bound on F^sq^Kj Sn) m the proof. Consequently, as claimed, Theorem Q] also 
holds if ((63 is assumed instead of ©. 

4 Implications of the main result 

We clarify now the relevance of our result. Consider a family of models V = {Pf : f G J~a} > 
Fa = UaeA^a, with densities pf with respect to some common dominating measure. 

Assume that we observe a sample = (Xi, . . . ,X n ) ~ pj,"\ Xi pf , /o G J~ Q for 

some unknown smoothness a £ A. The Bayesian approach consists of putting a prior 
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- /olloo < 2e n ) = P(£ 4 J ) > P(J = J„)P(£ 4 J "|J = J n ) 
= J n )F(Ei n HE( n \J= J n ) 

= J n )F(E{ n n n -Kf™ I J = J n ) 

= J n )E[p(E(» n s/» n Efr\j = J n , KjJ] 

= JnHi{Kj n e ns/"}P(£;/»|J = J„,K Jn )] 

= J n )P(^"|J = J n ) inf P(||0-0 o |U <e„|J = J„) 

- (Pa + c 3 ) Jn W 2V * 3 J n ) exp ( - c (M) J n log(l/e n )) 

- (C (M) + C 2 + C 3 ) Jn log(l/£n)) • 



(n) 

measure II on T C Ta which, together with the likelihood pj , leads to the posterior 
distribution H(-\X^) via Bayes' formula: 



, u J A p^(X^)dU(f) 

E A XW = JA { , — 

' j^pf\x(n) )d u(f) 



for a measurable A C J 7 . The asymptotic behavior of the posterior distrib ution can be 
studie d from the point of view of the probability measure Po 



P 



sec 



Ghosal et al 



For two densities pj and p g with f,g £ J 7 ^, define the (squared) Hellinger metric 
^ 2 (P/,P 9 ) = 2 (1 -E 9 yfy (X) / Pg (X)), Kullback-Leibler divergence K(p f ,p g ) = -E 9 log (p f (X)/p g (X)) 
and the Csiszar f- divergence V(pf,p g ) = E g log 2 (pf(X)/p g (X)). Define also the ball 
B(e n , fa) = {/ £ T : #(/, / ) < e 2 , / ) < e 2 }- 

The following theorem is the main result of Ghosal et al.l d200Ch ffor a version involv- 
ing two sequences e n and e n cf. also iGhosal and van der VaartJ (|200lJ >) which makes a 
statement about the asymptotic behavior of a posterior measure. 



Theorem 2 (Theorem 2.1 of lGhosal etldl (|2000h ). Suppose that for two positive sequences 
f-n > £n such that ree 2 > 1 and e n — > as n — > oo, sets JF n C T and constants c%, C2, C3, C4 > 
0, t/ie following conditions hold: 



log N(e n ,J r n ,h) <cine 2 nl 

n(B(e n ,/ )) >c 4 e- C3 " e_ ". 

T/ien, /or Zarge enoitc/Zi M > 0, Il(/ G J 7 : h(p f ,p fo ) > Me n \X^) 
Pf -probability. 



as n 



(17) 
(18) 
(19) 

00 in 



The conditions of this theorem require the existence of a sieve J- n with small entropy 
(|17p which contains most of the prior mass (|18p and which enough prior mass around 
the parameter /o which indexes the " true" underlying measure of the data. Assume now 
that the models in V are such that for d 2 being h 2 , K or V, d 2 (pj,pf Q ) < \\f — /o||f>- 
If in addition one can prove that in the considered model h(pf,pf ) > ||/ — /oil 2, then 
Theorem [2] delivers a contraction rate e n with respect to the L2-distance as well. Some 
examples of models for which the above relations between norms can be established are, 
among others, density estimatio n, non-parame t ric re gr ession, binary regression, Poissq n 
regression and classification; cf. IGhosal et al.1 g» Ide Jonge and van Zantenl (|2012h . 
Shen and Ghosal ( 2012 ). In this case one can apply our meta-theorem (Theorem [1]) to 
obtain an adaptive contraction rate which essentially verifies (|17p - (|19p for our spline-based 
prior. We summarize this in the following theorem. 

Theorem 3. Let II be the spline prior described in Section Consider a family of 
models V = {P/ : / G Fa) > ^A = ^aeAFa, with densities pf with respect to some 
common dominating measure. Assume also that the models in V are such that for d 2 
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being h 2 , K or V , d 2 (pf,pf ) < \\f — /o|||- Take an i.i.d. sample X^ = (Xi, . . . ,X n ), 
Xi ~ Pf , fo G To., 1 1 fo | |oo < M, for some unknown smoothness a € A. Consider a prior 
II which verifies ^ through (0j for certain constants c%, C2, C3, t±, t2 and t^. Assume also 
that either a < 1 or £2 A = 1. 

JTien, /or large enough C > 0, U(f G T : h(p f ,p fo ) > Ce n |X( n )) -> as n — >• 00 in 
P -probability for e n = C 3 n- a /( 2a + 1 )(lo g n) a /( 2a + 1 )+( 1 -( tlA *3))/ 2 . Ifh(p f ,p fo ) >\\f- foh 
then in the previous statement the Hellinger distance may be replaced by the L 2 distance 
and the statement remains valid. 

Proof. We have that for some constant k > and T = S, J- n = S n , 

N(e n ,P n ,h) <N(e n /k,P n ,\\ ■ || 2 ), 

n(J\J n ) = P( SeiKj ^J n ), 
nOB(e n! / )) > P[\\sg jKj - /olloo < e n /k). 

The first inequality follows from the fact that by assumption h(pf,p g ) < k\\f — g\\2 and so 
an e/k cover of J- n according to || • H2 induces an e cover of J- n according to h. Then, since for 
d 2 being if or V, d 2 {p f ,p fo ) < k\\f - / |||, we have B{e n , f ) D {/ G T : ||/-/o||a < z/k} 
and the last inequality follows. 

By assumption /o G T a satisfies the conditions of Theorem [TJ assume §S§ holds for 
some Cf . Consider then a prior that satisfies — dHJ) - Let us present a choice of quantities 
M n , S(j), J n , J n , e n and e n which meet conditions @-(HI]). First of all, sequence M n can 
be taken as a polynomial in n (for instance, for normal or exponential conditional priors 
for 6 W in (|1U|)) and l/5(j) as a polynomial in j. Next, note that there is no J n that 
satisfies (fTTj) unless a < 1 or £ 2 A £3 = 1. If either a > 1 or t 2 A £3 < 1, then the best 
possible choices are J n = rC^ a (e n ) _1 / a , e n = C , i(logn/n) a A 2a + 1 ) for sufficiently large 
d, J n = C 2 n 1 /(2«+i)(log n ) 2a /( 2a+1 )-' 1 for sufficiently large C 2 , and finally, 

€n = C3 n -«/(2 a +l)( lo g n )a/(2 Q +l)+(l- tl )/2 

for sufficiently large C3. Since these quantities satisfy (f9|l- (fTT|) . Theorem [1] implies condi- 
tions P^1) ~ P^|) for the quantities defined above. Finally, applying Theorem [51 we conclude 
that the contraction rate of the resulting posterior is at most e n , which appears to be op- 
timal (up to a logarithmic factor) in a minimax sense over the Holder class T~L a (also over 
a-smooth Sobolev class). 

□ 

Remark 5. A priori, it may be unknown whether a > 1 or not, or it may be simply known 
that a < 1. We can however always ensure the condition t 2 A t% < 1 by an appropriate 
choice of prior. For example, we take a geometric prior on J so that i 2 = and a prior on 
Kj such that ([20}) (which implies ([?[)) holds with, say, t% = 0. 

Remark 6. The common practice, in applications, of endowing the location of the knots 
with a Poisson point process prior results in a prior that does not verify assumption ([6]). 
Assumption ([fT]) . however, permits this so long as a large enough point mass is placed at 
an equally spaced knot vector. This very simple modification assures that our Theorem [3] 
may be applied to show that these priors result in a rate adaptive posteriors. 
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5 Examples of Priors 



We give now examples of particular choices for the several components of our hierarchical 
prior which verify conditions through ([8|) and ([H 7 ]). 

As for the prior on the number of basis functions, assumptions Q and ([5]) h old fo r 



Assumption ([8]), on the other hand, will trivially hold if we assume, for example, the coor- 
dinates of 6 G W to be (conditionally on J = j) independent and identically distributed 
according to a density uniformly bounded away from zero on the interval [—M,M]. 

There is an ample choice of priors on Kj, given J = j, which satisfy condition ([U]). 
First note that this condition enforces the prior on the location of the knots, for each J = j, 
to be such that, with probability 1, adjacent knots are at least 8(j) apart. The function 
1/5 (j) can be taken as a polynomial in j of high degree which makes the requirement less 
restrictive. If a certain sequence e n verifies the conditions of Theorem [1] then an increase 
in the exponent of l/5(j) can be accommodated by making e n larger by a multiplicative 
factor (cf. condition Q.) 

A simple choice for the prior on Kj, given J = j, is to pick (J — q) knots uniformly at 
random, without replacement, on a uniform <5(j)-sparse grid. This construction is possible 
if 8 is chosen in such a way that |_V^(i)J > 3 ~ Q. f° r an 3- Another example is to take, 
for each j, the (j — q) inner knots in Kj to be generated sequentially in the following 
way: add a knot K\ uniformly at random on the interval [6(j),l — S(j)], then a knot 
K2 uniformly at random on the interval [S(j), 1 — 5(j)]\(Ki — S(j), K\ + S(j)) and so on. 
Finally, take the ordered Kj = (-f^(i), • • • , Kij_ q \). This construction is always possible if 
1/S(j) grows faster than 2(j — q + 1). (If J is Poisson distributed, these points are simply 
distributed like a homogeneous Poisson process, conditioned to have all points at least 
S(J) apart.) Note that for this construction, the probability P(m(Kj) > 8(j)\J = j) is at 
least (1 — 2(j — q)5(j)y~ q which is very close to one if j is large and 1 /8(j) is a large power 
of j, say. Clearly, condition ([6]) is satisfied for these two constructions since all prior mass 
is concentrated on partitions with sparseness larger than S(j). 

It is also easy to see that condition ([7j) is verified for the knot vectors obtained from one 
of these two constructions. In fact, condition ([7]) is trivially fulfilled if, for some < £3 < 1, 



where kj G fCj is the set of (j—q) equally spaced inner knots. This suggests a mechanism to 
assure that any prior which verifies ([6]) can be slightly modified to also verify ([7]) : given J = 
j, generate a Bernoulli random variable X with success probability, say, exp(— c^j log* 3 j); 
if X = 1, then take Kj = kj, otherwise pick the knots in Kj according to any procedure 
which verifies Q, for instance one of two procedures described above. The resulting prior 
will trivially satisfy both and ([7]). 

Condition ([6]) necessarily excludes some partitions from the support of the prior (and 
then also from the support of the posterior.) As mentioned before very few partitions will 
be excluded so long as l/6(j) is a large enough power of j. It is nonetheless of interest 
to design a weaker alternative for condition ©. Condition ([G 7 ]) plays this role, in that 



the geometric, Poisson and negative binomial distributions; cf. IShen and 





(20) 
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it allows priors on K which have any partition of [0, 1] into non-empty intervals in its 
support. 

Assuming condition (jlT]) instead of ([6]) consequently allows us to put positive mass on 
any vector of simple knots in a straightforward way: generate a Bernoulli random variable 
with success probability 1 — csexp(— c^n); if X = 1 take Kj = kj, equally spaced; if 
X = then take an arbitrary Kj (for example independent, uniformly distributed points 
on [0,1].) So long as we take 1/S(j) = j and r > q then conditions and ([7]) are 
verified. This procedure, although simpler, does place little prior mass on knot vectors 
with inhomogeneous distributions. 

An alternative, less degenerate prior, which verifies ([B 7 ]) and ([7]) can be obtained in the 
following way: given J = j, first, generate a Bernoulli random variable X\ with success 
probability csexp(— c^n); if X\ = 1 distribute the (j — q) knots arbitrarily; if X\ = 
then generate another Bernoulli random variable X% with success probability, exp(— j); if 
X2 = 1 then take (j — q) equally spaced knots kj; If X2 = 0, then place the knots such 
that © is verified. This procedure should allow good control on the prior on the knots 
while not excluding any knot vectors. 

Note that the priors described above which verify through ([8]) do not depend on 
the sample size n, as prescribed by the Bayesian paradigm. Condition ([6^]) is a weaker 
requirement then condition ([6|) but it will, introduce a dependence on the sample size n 
in the prior. 



6 Technical results 

In this section we collect some technical results. Lemmas [JJ and [2] are needed to bound the 
entropy number of the sieves S n in Theorem [JJ Lemma claims in essence that if some 
bounds on the range of the function /o are known, then this knowledge can be incorporated 
into the prior on th e coefficients 6. 



Theorem 4.26 of ISchumakeri (|2007l ) claims that if all the inner knots of a B-spline are 
simple, then the B-spline is continuous, uniformly over its support, with respect to its 
knots. In Lemma [2] we establish a slightly stronger result (a Lipschitz-type property): if 
we take two splines with the same coefficients in their respective B-spline basis, then the 

distance between the splines can be bounded by a multiple of the distance between 
the two sets of knots, as long as the sets of knots are sufficiently sparse. First, we present 
a preliminary lemma. Denote the (r + l)-th order divided difference of a function h over 
the points ti, . . . , t r+ \ as [t\, . . . , t r+ i]h = (fa, ■ ■ ■ , t r +i]h — [ii, ■ ■ ■ , t r ]h) / {t r+ \ — t\), with 
[ti]h = h(ti). If t\ = ■■■ = t T+ i then [ti, . . . , t r +i]h = h^ r \ti)/r\ for a function h with 
enough derivatives at t\. 

Lemma 1. Let i S {1, . . . , r}, r > 2, {k\, . . . , k r+ \) € (0, l) r+ . Assume k v+ \ — k v >5>0 
for v = 0, . . . , i — 1, i + 1, . . . , r and ki + \ — hi = 0. For fixed x € [0, 1] take the function 
h(y) = (x — y)+ 1 with y G [0, 1] and q>2. Then the divided difference \ [kx, . . . , A; r+ i]/i| < 
4/<5 r for x 7^ ki. 

Proof. Notice that \h'{y)\ = (q — l){x — y)+~ 2 < {q — 1) < 1/5 for x ^ y, as q > 2 and 
thus 5 < &2 — k\ < 1 < ^pj. Next, if v = i — 1, | [k v+ \, k v +2]h\ = W{k v +i)\ < 1/^; 
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if v ^ i - 1, |[feu + i, fe„+2]/i| = \h(k v+2 ) - h(k v+ i)\/\k v+2 - k v+ i\ < 2/5. We conclude 
| [k v+ i, k v+2 ]h\ < 2/5 as long as x ^ k{. 

For j = 2, . . . , r, d efine 7,- = min„ = i r+ i_,- — A; v | > (J — 1)<5. Now we make use 
of Theorem 2.56 from ISchumakerl (|2007l ) and the previous bound: 



|[*i 



; k, 



r-1 

,-x\h\ < E 

v=0 



r — 1 

v 



\[k v+ i,k v+2 \h\ 
72 • • • 7r 



< 



V 4 

i)\s r ~ ¥ 



holds for all x ^ ki. This completes the proof of the Lemma. 



□ 



-(<?+!) 



Lemma 2. Let 6 G W satisfies \\9\\oo < M and let k,k' G /Cj = {k G /Cj : m(k) > 5} be 
such that \\k — fc'||oo < S. Then \\se t k — se t k'\\oo < L\\k — fe'||ooj f or L = ^j{q + 1)M8~ 

Proof. Define k l = (k[, . . . , k l j_ q ) = (k[, . . . , k[, ki + i, . . . , kj- q ) for I = 0, . . . , j — q, such 
that k° = k and k^ q = k'. We get 



- S0,*'L = II E - E ^ M ll E( s *° - fl 



i=l 



i=l 



i=l 



< jM max ||B fc ° - £f 1 < jM max V II Bf - B. 

l<i<j °° Ki<7 f-^ 1 



i<j 
k l+1 \ 



1=0 



<(g + l)?'Mmax max II R- - B. 

~ y ' l<i<jO<l<j-q-l U " "°" 

The last inequality follows from (pQ) and the fact that the inner knots of Bf l and Bf l+1 

differ only at the (I + l)-th entry . 

Theorem 4.27 of Schumakerl ( 2007 ) gives explicit expressions for the derivative of a 
B-spline with respect to one of its knots. These expressions are in terms of the divided 
differences which satisfy the conditions of Lemma [H so that combining this with Lemma 
Q] for r = q + 1 (the maximal number of knots in the support of a B-spline) yields that 
this derivative is bounded in absolute value by 45~( g+1 \ except at x = k\ +l , where it 
is not defined. Then, as ||fe' — /^ +1 ||oo < ||& — fc'Hooi we obtain that, for x ^ k l [+1 , 
I = 0,... , j -q- 1, 



B?{x)-Bt l+ \x)\<\k\X[-k 



l+i 



l+l I SU P 
feI +1 e(o,i) 



dB?(x) 



dk 



i+i 



< 



4||fc - fe'| 



Since splines are continuous for all q > 1, so is sq ^ — s@ y and we conclude that the same 
bound must also hold for x = k\ +1 . Combining the above two relations concludes the 
proof. 

□ 

The properties of B-splines allow to relate the range of the coefficients of the approxi- 
mating splin e to the range of the app roximated function. The following lemma generalizes 
Lemma 1 of lShen and Ghosall ( 20121 ) for non-equally spaced knots. 
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Lemma 3. Let f £ T a (so that (E|) holds), a < b, e > 0. Assume that f(x) G [a + s,b — e] 
for all x € [0,1]. Then there exits a positive constant 5 = <5(J 7Q ,e) such that for any 
k € fCj , j > q, such that M{k) < 5, the coefficients a of the approximating spline s a> fc in 
(0j can be taken to be contained in (a, b). 

Proof. Fix q, j and inner knots k, assume I = [a, b], a < b and a + e < f < b — e, for some 
e > 0. 



We use results from section 4.6 of ISchumakerl (|2007l ) on dual basis of B-splines. If 
l , ■ ■ ■ j &j 



B\, . . . , B^ is the B-spline basis associated with the inner knots k, then there exists a 



dual basis Ai, . . . , \j of linear functionals such that, for each i,r = 1, ... ,j, \ r Bf = 1 
if i = r and is otherwise. As a consequence, we obtain that XiS a ^ = ai, and since 
J2i=i Bf{x) = 1, it follows that AjC = c for any constant c and all i = 1, . . . , j. This dua l 
basis is not necessarily unique and, according to Theorem 4.41 from Schumaker ( 20071 ). 



can be taken such that |Aj/| < C\ sup x . gJ . \ f(x)\ where Ij represents the support of Bf 
and constant C\ depends only on q. Each /j consists of at most q adjacent intervals in the 
partition induced by k and thus the length of Ii is bounded by qM(k). 
Let s at k be such that ([3]) is fulfilled for /. Then for any constant c 

\(H-c\ = |AiS 0)fc - Aj/ + Ai/-c| < |Ai(s 0)fc -/)| + |Aj(/-c)| 
< CiC / Af Q (fc) + Cisup|/(x)-c|. 

x£li 

Take c = inf xg ^ f(x) and recall that / € T a C C(n a , L a ). Using the Lipschitz property, we 
derive that sup^g^. \ f(x) — c\ = sup xe j. f(x) — inf^g^ f(x) < L a (q M(k)) Ka and therefore 

\ ai - inf f{x)\ < C l C f M a (k) + C 1 L a (qM(k)) Ka < C 2 M aAKa {k). 
xeii 

In the same way, if we take c = sup^gj. f(x), we derive that sup xg/ . \ f(x)—c\ < L a (q M(k)) K ° 
and thus \a { - sup xe/i f(x)\ < C 2 M^ Ka {k). 

Now for 5 = (e/(2C 2 )) 1/{aAKa) conclude that if M(k) < 5, then 04 > inf xe/i f(x) - 
C 2 M ahKa (fc) > inf xg ^ f{x)—e/2 > a. For the same choice of 5 we have a, < sup^g^. f{x) + 
C 2 M aAKa (k) < sup x6J . f{x) + e/2 < 6. 

□ 
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